Week 3 of 12 · Part A — Applied Safety

What Responsible Red-Teaming Is

A structured, ethical fire drill — not an adventure in jailbreaking

Day 11 ~60 minutes Concept

Day 11 of 60

The reframe: red-teaming is a defensive discipline

"Red-teaming" sounds like the fun part — break the model, share the screenshot. Reframe it now, because the framing is the job. Red-teaming is the discipline of deliberately trying to make a model fail, under controlled, recorded, defensive conditions, so the failures are found by you and documented — not discovered by a bad actor in the wild, unrecorded, at scale. It is the safety field's fire drill: you start the fire on purpose, in a contained space, with extinguishers ready, precisely so the building's weaknesses are known before a real fire finds them.

The thesis

The value of red-teaming is not the attack — it's the record. A failure that's found, categorized, logged, and routed to a fix makes the model safer. The same failure found by an attacker, undocumented, makes it less safe. Responsible red-teaming is the practice that converts the first into a controlled, measurable, defensive output instead of the second.

This week's blueprint is Red Teaming Language Models to Reduce Harms (Ganguli et al., 2022) — a lab collecting roughly 39,000 real adversarial attempts, studying how red-teaming scales with model size, and devoting explicit attention to protecting the humans who do the work. Notice the title: to reduce harms. That clause is the whole posture.

Categories, not recipes — the bright line

There is one hard line that separates a safety practitioner from a misuse actor, and it runs through what you write down. A red-teamer works in terms of attack categories — the kinds of weakness being probed — never operational recipes, payloads, or reusable misuse instructions. The category is the safety-relevant unit; the recipe is the dangerous one. You can run, measure, and improve an entire program at the category level and never need a single working exploit string in your notes.

Core Theory

Attack categories you reason about (not recipes)

A working starter set, expressed only as categories of weakness:

1 · Persona / role-play pressure

Attempts to push the model out of its intended behavior by framing it as a different character or context. You log the category and whether it was defended — never the framing text itself.

2 · Prompt injection via untrusted content

Instructions smuggled in through retrieved pages, documents, or tool output rather than the user turn. The defining safety question is whether the model honors the trust boundary.

3 · Requests for harmful instructions

Attempts to elicit operational assistance with real-world harm. You measure whether the model refused — you never record what was asked for in actionable form.

4 · Privacy / data extraction

Attempts to pull private or memorized information out of the model. Logged as a category and outcome, with any real data redacted before it touches a record.

5 · Multimodal or formatting evasion

Probes that move the attack into an image, encoding, or unusual format to slip past a text filter. The category captures the coverage gap without preserving the technique.

Why categories are enough

Coverage, attack-success-rate, and trend-over-time are all computable from category + outcome alone. The operational payload adds risk and adds nothing to the metric. If your red-team artifact reads like a how-to guide, you've built the wrong artifact — and created a liability.

The human cost — and why protecting red-teamers is part of the method

Red-teaming means people spending hours deliberately steering a system toward its worst outputs and reading what comes back. That exposure has a real psychological cost, the same one documented for content-moderation work. Ganguli et al. treat the well-being of red-teamers as a first-class concern, not a footnote — and a responsible operation builds for it from the start: rotation off heavy categories, exposure limits, consent and clear opt-outs, support resources, and never treating a person as an indefinite firehose for the most disturbing material.

A lead's obligation

If you ever run a red-team, the people doing it are an asset on your threat model too. Designing the work so it doesn't quietly harm them is not "soft" — it's part of doing the method correctly. A program that burns out its red-teamers also gets worse data.

Your work today

Read the Blueprint

~60-minute foundation

  1. Read §2 and the lessons of Red Teaming Language Models to Reduce Harms (Ganguli et al.). Note two things: how red-teaming changes with model size, and what they say about protecting the people doing it.
  2. Skim OpenAI's Approach to External Red Teaming — focus on how it's run as a program (who's brought in, why, and how people and automation combine), not as a one-off.
  3. In a notebook, write down 4+ attack categories (not recipes) in your own words, and one sentence on why annotator well-being is a method requirement, not a nicety.
The expert move

An enthusiast frames red-teaming as breaking the model and showing off the jailbreak. An expert frames it as a controlled, recorded, defensive operation whose product is a categorized record — and refuses, on principle, to produce operational misuse content even while probing for it. The altitude jump is from "I found a way to break it" to "I run a measurable program that finds, categorizes, and routes failures while protecting the people who do the work."

Say this in an interview: "I treat red-teaming as a defensive discipline: controlled conditions, attack categories rather than recipes, every attempt logged so coverage is measurable, and an explicit well-being protocol for the red-teamers. The deliverable is a record that makes the model safer, not a collection of exploits."

Today's Takeaways